As the distributions of animals, plants and fungi are influenced by biotic and abiotic processes acting across a range of spatial scales, it is essential to consider both broad- and fine-scale factors when assessing a translocation recipient site. At broad scales, distributions are primarily shaped by dominant ecosystem processes and physiological tolerances to abiotic factors such as climate, soils, geology and hydrology. The spatial and temporal heterogeneity of these abiotic characteristics contributes to differences in population performance and persistence across species’ ranges. Therefore, determining suitable locations for assisted colonisation necessitates an understanding of the focal species’ abiotic niche.
The most widely recognised and commonly applied statistical approach for quantifying a species’ niche space and mapping it geographically is known as ‘ecological niche modelling’. Depending on the intended application, this method is also termed ‘species distribution modelling’ (when modelling current distributions) or ‘habitat suitability modelling’ (when modelling the suitability of habitats). Although more process-based (‘mechanistic’) approaches also exist (Kearney & Porter, 2009), all of these models employ mathematical algorithms to quantify relationships between species occurrences and environmental variables. Once trained on these relationships, an ecological niche model (ENM) can predict the suitability of sites within a given region, such as when evaluating potential assisted colonisation recipient sites.
In the Guidelines for Reintroductions and Other Conservation Translocations, the International Union for the Conservation of Nature (IUCN) suggest that the climate requirements of the focal species be “matched to current and/or future climate at the destination site”. After training an ENM on relationships between species occurrences and recent climatic conditions, models can be projected onto future climate scenarios from the Intergovernmental Panel on Climate Change (IPCC) to predict patterns of suitability change in the future. This functionality makes ENMs especially applicable to assisted colonisation projects, where species are moved in response to existing and anticipated future threats within the indigenous range. However, projecting models to future scenarios comes with several potential sources of uncertainty, such as predicting into non-analogue climates and selecting among different climate change models and scenarios, both of which can significantly influence site selection decisions. Despite these challenges, it can be argued that the biological and financial risks of disregarding climate change impacts outweigh those associated with acting on model projections—provided these uncertainties are carefully managed and clearly communicated.
In the tutorial presented here, we introduce the main steps for building and evaluating ENMs for use in recipient site assessment and selection. The primary goal here is not to present a detailed set of guidelines on how to construct an ENM, there are already plenty of great resources designed for this purpose (see Guisan et al., 2017; Araújo et al., 2019; Zurell et al., 2020; Sillero et al., 2021); rather, it is to demonstrate how to build and project these models for use in selecting potential sites for assisted colonisation. It is important to emphasise that ENMs should not be viewed as a substitute for field-based assessments; instead, they complement fine-scale assessments by offering an additional layer of insight on a location’s suitability for translocation. Due to data constraints, ENMs can often only be implemented at relatively coarse spatial resolutions (e.g., 1-km cells), which may not capture the relevant suite of microhabitat and microclimatic conditions present at a site. Consequently, site-based, fine-scale assessments are likely to be more informative for understanding a species’ precise environmental affinities. Nevertheless, ENMs provide a valuable broad-scale perspective on the environmental conditions a species encounters and serve as a tool for projecting suitable macroclimate into the future.
The ENM modelling process can be broken down into five main steps: (i) conceptualisation, (ii) data preparation, (iii) model fitting, (iv) model assessment, and (v) prediction (Zurrell et al., 2020). The tutorial presented here follows this standardised workflow to model building and analysis. As emphasised by Zurrell and colleagues (2020), model building is an iterative process with opportunities for learning and knowledge-building along the way. Consequently, it is beneficial to revisit and refine certain steps, such as the choice of modelling extent or the presence/absence sampling design, to improve the model. For this reason, it can be helpful to view model building as a continuous cycle of refinement and improvement, rather than a linear workflow with a fixed endpoint.
There are many repositories and potential sources of species occurrence data. The most taxonomically and geographically comprehensive is GBIF, the Global Biodiversity information Facility. For UK data, the National Biodiversity Network (NBN) Atlas also warrants consideration. Occurrence data from these repositories can be heavily biased and unreliable due to factors such as uneven survey efforts, coordinate imprecision and uncertainty, presence of cultivated specimens, data collection practices and national policies on data digitisation and sharing (Beck et al., 2014). If left unaddressed, these biases can distort ENMs, leading to misleading conclusions about species ranges and climatic niches. For example, GBIF data may overrepresent species occurrences in well-funded, data-sharing regions while underrepresenting regions with higher actual species densities. Similarly, some data partners choose to add uncertainty to otherwise precise occurrence records of threatened species, which are the most likely to be targeted for an assisted colonisation attempt. Nevertheless, it is possible to build robust and reliable ENMs fit for downstream conservation decisions using occurrence data from these repositories provided that biases and uncertainties are carefully addressed.
Araújo, M. B., R. P. Anderson, A. M. Barbosa, C. M. Beale, C. F. Dormann, R. Early, R. A. Garcia, A. Guisan, L. Maiorano, B. Naimi, R. B. O’Hara, N. E. Zimmermann, and C. Rahbek. 2019. Standards for distribution models in biodiversity assessments. Science Advances 5, 1–12.
Beck, J., Böller, M., Erhardt, A. and Schwanghart, W., 2014. Spatial bias in the GBIF database and its effect on modeling species’ geographic distributions. Ecological Informatics, 19, 10–15.
Guisan, A., Thuiller, W. and Zimmermann, N.E., 2017. Habitat suitability and distribution models: with applications in R. Cambridge University Press.
Kearney, M. and Porter, W., 2009. Mechanistic niche modelling: combining physiological and spatial data to predict species’ ranges. Ecology letters, 12(4), pp.334-350.
Sillero, N., S. Arenas-Castro, U. Enriquez-Urzelai, C. G. Vale, D. Sousa-Guedes, F. Martínez-Freiría, R. Real, and A. M. Barbosa. 2021. Want to model a species niche? A step-by-step guideline on correlative ecological niche modelling. Ecological Modelling 456, e109671.
Zurell, D., J. Franklin, C. König, P. J. Bouchet, C. F. Dormann, J. Elith, G. Fandos, X. Feng, G. Guillera-Arroita, A. Guisan, J. J. Lahoz-Monfort, P. J. Leitão, D. S. Park, A. T. Peterson, G. Rapacciuolo, D. R. Schmatz, B. Schröder, J. M. Serra-Diaz, W. Thuiller, … C. Merow. 2020. A standard protocol for reporting species distribution models. Ecography 43, 1261–1277.
This section assumes the user has already obtained an occurrence dataset for their focal species. In the code below, we are using an occurrence dataset for the Fenn’s wainscot moth Protarchanara brevilinea, downloaded in Darwin Core, a data standard for publishing and integrating biodiversity information. This dataset was downloaded from GBIF. It is also possible to download GBIF data directly in R through the rgbif package, which wraps R code around the GBIF API (see this tutorial).
The aim here is to model the future climate suitability for the Fenn’s Wainscot Moth Protarchanara brevilinea in the United Kingdom and identify potential sites for assisted colonisation. This is a nationally rare Biodiversity Action Plan species that is largely restricted to the Norfolk Broads in the UK. The larva feed on Common Reed Phragmites australis, initially in the stems and later on the leaves, over-wintering as an egg. Habitat quantity and quality is decreasing due to increased reed-bed management and the species is threatened by sea level rise (Fox et al., 2019).
Figure 1. Fenn’s Wainscot Moth Protarchanara brevilinea
First, we will load the required packages, set the working directory, read in our raw GBIF data and select the fields we are most interested in.
# Load the required packages
library(CoordinateCleaner)
library(dplyr)
library(ggplot2)
library(leaflet)
library(mapview)
library(sf)
setwd("C:\\Users\\Joe\\OneDrive - Liverpool John Moores University\\ESRT JB Personal\\")
raw_occ <- read.delim("./Protarchanara_brevilinea/occurrence.txt",
head = TRUE)
# Be aware of access rights and the various licences. For example, a "CC_BY_NC_4_0" licence does not permit the use of data for
# commercial gain
table(raw_occ$license)
##
## CC_BY_4_0
## 498
## CC_BY_NC_4_0
## 1232
## CC0_1_0
## 549
## https://creativecommons.org/licenses/by-nc/4.0/legalcode
## 566
## https://creativecommons.org/licenses/by/4.0/legalcode
## 5
## https://creativecommons.org/publicdomain/zero/1.0/legalcode
## 7
# Only retain records of species presence, "PRESENT"
raw_occ <- raw_occ[raw_occ$occurrenceStatus == "PRESENT", ]
# Select columns of interest
GBIF_Dat <- raw_occ %>%
dplyr::select(species, family, gbifID, occurrenceStatus, decimalLongitude, decimalLatitude,
coordinateUncertaintyInMeters, coordinatePrecision, countryCode, stateProvince, year, eventDate,
institutionCode, datasetName, occurrenceID, scientificName, taxonRank,
basisOfRecord, hasGeospatialIssues, issue,
speciesKey, taxonKey)
## Let's first remove rows without coordinates
GBIF_Dat_geo <- GBIF_Dat[!(is.na(GBIF_Dat$decimalLatitude) | GBIF_Dat$decimalLongitude==""),]
## View the occurrences on an interactive map with the mapview package
GBIF_sf <- st_as_sf(GBIF_Dat_geo, coords = c("decimalLongitude", "decimalLatitude"), crs = 4326) # Convert GBIF data to a spatial points object
mapview(GBIF_sf, col.regions = "darkred", cex = 2) # View the occurrences on an interactive map
# Or plot the occurrences using ggplot2
worldmap <- borders("world", colour = "gray50", fill = "gray50")
ggplot() +
coord_fixed() +
worldmap +
geom_point(data = GBIF_Dat_geo,
aes(x = decimalLongitude, y = decimalLatitude),
colour = "darkred",
size = 0.5) +
theme_bw() +
ggtitle("Raw GBIF occurrence data for Protarchanara brevilinea")
## Remove historical records/records collected prior to the period covered by the climate data
# Optional step: Get rows with blank year (otherwise they will be removed by the next function)
Blank_years <- GBIF_Dat_geo[is.na(GBIF_Dat_geo$year),]
GBIF_Dat_geo_post1980 <- GBIF_Dat_geo[GBIF_Dat_geo$year >= 1980,]
GBIF_Dat_geo_post1980 <- rbind(Blank_years, GBIF_Dat_geo_post1980) # Combine the blank rows df with the post-1980 df
## Remove records with a coordinate uncertainty of greater than a particular threshold. !!Note we also remove records where no coordinate uncertainty was reported in this step - decide whether you should do the same!!
# A commonly observed rule-of-thumb is to remove records with an uncertainty of more than the resolution of the environmental data
GBIF_Dat_geo_post1980_1km <- GBIF_Dat_geo_post1980[GBIF_Dat_geo_post1980$coordinateUncertaintyInMeters <= 1000,]
## Remove coordinates reported to a precision of less than 2 d.p.(i.e. ~ 1km)
GBIF_Dat_geo_post1980_1km <- GBIF_Dat_geo_post1980_1km[grep("\\.[0-9][0-9]", GBIF_Dat_geo_post1980_1km$decimalLatitude), ]
## Now let's have another look at the data in mapview
## View the occurrences on an interactive map with the mapview package
GBIF_Dat_geo_post1980_1km_sf <- st_as_sf(GBIF_Dat_geo_post1980_1km, coords = c("decimalLongitude", "decimalLatitude"), crs = 4326) # Convert GBIF data to a spatial points object
mapview(GBIF_Dat_geo_post1980_1km_sf, col.regions = "darkred", cex = 2) # View the occurrences on an interactive map
## Some additional checks
table(GBIF_Dat_geo_post1980_1km$coordinatePrecision) # Check GBIF's own precision results
## < table of extent 0 >
table(GBIF_Dat_geo_post1980_1km$basisOfRecord) # Check for unsuitable data sources (e.g. fossilised or machine-based)
##
## HUMAN_OBSERVATION PRESERVED_SPECIMEN
## 321 21
table(GBIF_Dat_geo_post1980_1km$hasGeospatialIssues) # Check for GBIF's geospatial flags
##
## false
## 342
#table(GBIF_Dat_geo_post1980_1km$issue) # Check for other spatial and taxonomic issues
Next, we are going to use the CoordinateCleaner package to flag problems with the remaining records. This package was created specifically to scan datasets of species occurrence records for geo-referencing and dating imprecisions and data entry errors in a standardised and reproducible way. You can read more about the CoordinateCleaner package in Zizka et al. (2018) published in Methods in Ecology and Evolution.
# First run some generic checks using CoordinateCleaner
flags <- clean_coordinates(x = GBIF_Dat_geo_post1980_1km,
lon = "decimalLongitude",
lat = "decimalLatitude",
countries = "countryCode",
species = "species",
tests = c("centroids", # tests a radius around country centroids.
"equal", # tests for equal absolute longitude and latitude.
"gbif", # tests a one-degree radius around the GBIF headquarters in Copenhagen
"seas", # tests if coordinates fall into the ocean.
"zeros" # tests for plain zeros, equal latitude and longitude and a radius around the point 0/0.
))
## Reading layer `ne_50m_land' from data source
## `C:\Users\Joe\AppData\Local\Temp\RtmpYJEBIC\ne_50m_land.shp'
## using driver `ESRI Shapefile'
## Simple feature collection with 1420 features and 3 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -180 ymin: -89.99893 xmax: 180 ymax: 83.59961
## Geodetic CRS: WGS 84
# identify AND **remove** spatial duplicates
GBIF_Dat_geo_post1980_1km_nd <- cc_dupl(GBIF_Dat_geo_post1980_1km, lon = "decimalLongitude", lat = "decimalLatitude", species = "species",
value = "clean")
# identify records in the vicinity (I chose 1km) of biodiversity institutions
cc_inst_vect <- cc_inst(GBIF_Dat_geo_post1980_1km, lon = "decimalLongitude", lat = "decimalLatitude", species = "species",
value = "flagged", buffer = 1000)
## Other potentially useful CoordinateCleaner flagging functions
# outlier detection - THIS FUNCTION SHOULD BE USED WITH GREAT CAUTION because it represents a mathematical approach and thus overlooks the plethora of potential
# reasons why a record is an outlier, e.g., isolated populations, data-poor region, introduced population, etc.
#cc_outl_vect <- cc_outl(GBIF_Dat_geo_post1980_1km, lon = "decimalLongitude", lat = "decimalLatitude", species = "species",
# value = "flagged")
# occurrences in urban areas
#cc_urb_vect <- cc_urb(C_pitcheri_occs, lon = "decimalLongitude", lat = "decimalLatitude",
# value = "flagged", ref = ne_50m_urb)# identify records in urban areas
## Check for rasterized sampling (common in GBIF due to submission of atlas data)
# These records might have a low precision and might therefore be problematic for some analyses. For instance presence derived from a
# 1 degree grid of a national atlas might be to coarse for small scale species distribution models.
cc_round <- cd_round(GBIF_Dat_geo_post1980_1km_nd,
lon = "decimalLongitude",
lat = "decimalLatitude",
ds = "species",
value = "flagged",
T1 = 7,
graphs = F)
table(cc_round) # For Protarchanara brevilinea, no rasterized data was detected (yay!)
## cc_round
## TRUE
## 151
Now let’s visually assess the remaining records to identify and remove occurrences outside the known range or in areas where the distribution is poorly documented. Note that for P. brevilinea, there are a couple of very isolated records in Asia, which we are going to remove for two reasons: 1) the distribution is poorly documented in this region 2) inclusion of these records would significantly increase the modelling extent and artificially inflate the discrimination evaluation metric area under the curve (AUC) (see VanDerWal et al, 2009)
# Convert GBIF data to a spatial points object and view on an interactive map
GBIF_Dat_geo_post1980_1km_nd_sf <- st_as_sf(GBIF_Dat_geo_post1980_1km_nd, coords = c("decimalLongitude", "decimalLatitude"), crs = 4326)
mapview(GBIF_Dat_geo_post1980_1km_nd_sf, col.regions = "darkred", cex = 2) # View the occurrences on an interactive map
# For occurrences in the Baltic region, we will use this publication as a reference:
# Nowacki, J. and Wasala, R., 2015. Protarchanara brevilinea FENN, 1864-a moth species new to Poland (Lepidoptera: Noctuidae).
# Polish Journal of Entomology, 84(1), p.3.
# For France:
# Barbut, J. and Lévêque, A., 2018. Protarchanara brevilinea (Fenn, 1864): redécouverte de l'espèce en France, dans le département
# de la Gironde (Lepidoptera Noctuidae Xyleninae Apameini). Alexanor, 28(5), pp.407-430.
# For the UK, the contributing institutions were checked and reliability was assumed based on their authority
# Remove records from Asia based on gbifID
GBIF_Dat_geo_post1980_1km_nd_sf <- GBIF_Dat_geo_post1980_1km_nd_sf %>%
filter(!gbifID %in% c(4923709395, 4910981608, 3912272783))
mapview(GBIF_Dat_geo_post1980_1km_nd_sf, col.regions = "darkred", cex = 2) # View the occurrences on an interactive map
#write.csv(GBIF_Dat_geo_post1980_1km_nd_sf, "Protarchanara_brevilinea/P_brevilinea_cleanOcc.csv")
Fox, R., Parsons, M.S. and Harrower, C.A., 2019. A review of the status of the macro-moths of Great Britain. Dorset, UK: Butterfly Conservation.
VanDerWal, J., Shoo, L.P., Graham, C. and Williams, S.E., 2009. Selecting pseudo-absence data for presence-only distribution modeling: how far should you stray from what you know?. Ecological modelling 220, 589-594.
Zizka, A., Silvestro, D., Andermann, T., Azevedo, J., Duarte Ritter, C., Edler, D., Farooq, H., Herdean, A., Ariza, M., Scharn, R. and Svantesson, S., 2019. CoordinateCleaner: Standardized cleaning of occurrence records from biological collection databases. Methods in Ecology and Evolution 10, 744-751.
In this section, we will focus on downloading and preparing the climate data for modelling. We will use climate data from the WorldClim platform, which provides global climate data at a range of resolutions for both historic and future time periods. Another valuable resource is CHELSA, which provides additional variables and extended temporal coverage, and may be worth exploring depending on your needs. For UK-only climate data, the Met Office has released a variety of high-quality datasets available through the CEDA archive.
library(ENMeval)
library(geodata)
library(mapview)
library(rJava)
library(rmapshaper)
library(SDMtune)
library(sf)
library(stars)
library(terra)
We can use the geodata package to access climate data from the WorldClim platform (Version 2.1) directly in R. We are going to download data at a 5 arc-minute resolution for a baseline period (1970-2000) and a future climate projection. Future climate projections are available for the latest Coupled Model Intercomparison Project Phase 6 (CMIP6) projections for a range of general circulation models (GCMs), shared socio-economic pathways (SSPs) and time periods. Be aware that not all combinations of GCM-SSP-Time are available for download and it is worth checking the WorldClim website before attempting to download the climate data in R.
!! IMPORTANT !! In this tutorial, we use a single climate change scenario to project future habitat suitability. While this approach is acceptable for demonstration purposes, it would not be appropriate in a real-world conservation planning context. Relying on a single scenario overlooks the substantial uncertainty involved in projecting future climate conditions. To better capture this uncertainty, we recommend using multiple Global Climate Models (GCMs) across several time periods, ideally within one of the more plausible emissions pathways, such as SSP2-4.5 (Hausfather et al., 2022). A general rule is to select a minimum of five GCMs with a low amount of interdependence in order to represent a significant proportion of uncertainty among climate model projections (e.g., hottest, driest, wettest scenarios). At the time of writing, the CHELSA climate data portal provides access to data on five GCMs (GFDL-ESM4, IPSL-CM6A-LR, MPI-ESM1-2-HR, MRI-ESM2-0, and UKESM1-0-LL), which were selected by the Inter-Sectoral Impact Model Intercomparison Project 3b (ISIMIP3b) due to their structural independence in terms of ocean and atmosphere model components (Lange 2021).
# Use the geodata package to download baseline climate data from WorldClim
setwd("C:\\Users\\Joe\\OneDrive - Liverpool John Moores University\\ESRT JB Personal\\")
#worldclim_global(var = "bio", # Bioclimatic variables 1-19
# res = 5,
# path = "./ClimVars/wc2.1_5m_baseline")
# Use the geodata package to download future CMIP6 climate data from WorldClim
#cmip6_world(model = "HadGEM3-GC31-LL",
# ssp = "245", # This is a medium emissions scenario
# time = "2041-2060", # Mid-century projections
# var = "bioc",
# res = 5,
# path = "./ClimVars/Future")
## Load the current/baseline climate data
climate_path <- "C:\\Users\\Joe\\OneDrive - Liverpool John Moores University\\ESRT JB Personal\\ClimVars\\wc2.1_5m_baseline\\"
climate_tifs <- list.files(climate_path, pattern = "\\.tif$", full.names = TRUE) # List all .tif files in the directory
climate_list <- lapply(climate_tifs, function(x) { rast(x) }) # Load each .tif file as a 'rast' object and store it in a list
# Stack the list of SpatRaster elements into a single object with 19 layers
ClimVars <- rast(climate_list)
# Rename
names(ClimVars) <- gsub("wc2.1_5m_", "", names(ClimVars))
names(ClimVars)
## [1] "bio_1" "bio_10" "bio_11" "bio_12" "bio_13" "bio_14" "bio_15" "bio_16"
## [9] "bio_17" "bio_18" "bio_19" "bio_2" "bio_3" "bio_4" "bio_5" "bio_6"
## [17] "bio_7" "bio_8" "bio_9"
# We will prepare the future climate data later
It is generally advisable to include range-wide occurrences in ENMs because this approach accounts for the full spectrum of environmental variability experienced by the species (Sánchez-Fernández et al. 2011; Raes 2012). There may be some support for partial modelling if your study region harbours conditions that are not represented elsewhere in the range and/or if natural populations within the study region are locally adapted (Hällfors et al. 2016; Smith et al. 2019). However, in the case of assisted colonisation, where we argue that potential climate change impacts should always be considered, models based on more localised extents potentially have reduced transferability to future climate periods. This is because they are more likely to encounter novel climate conditions in the projected variables, requiring extrapolation to generate suitability values.
A common approach to delimiting the geographical modelling extent is to extract ecoregions that intersect with the species presence records, as this method reduces the potential for artificial inflation of discrimination metrics which can occur when background points or pseudo-absences are selected from too large of an extent (VanDerWal et al., 2009; Acevedo et al., 2012). The critical assumption of this approach is that the ecoregions where the species occurs represents the areas that have been accessible to it over time periods relevant to the analysis (Barve et al., 2012; Sillero et al., 2021). The WWF Terrestrial Ecoregions is a popular resource used for this purpose (you can download the WWF Terrestrial Ecoregions shapefile here). The occupied terrestrial ecoregion polygons can then be used mask the selected environmental variables in preparation for modelling.
## Load in the wwf ecoregion data
wwf_eco <- st_read("C:\\Users\\Joe\\OneDrive - Liverpool John Moores University\\ESRT JB Personal\\wwf_eco\\wwf_terr_ecos.shp")
## Reading layer `wwf_terr_ecos' from data source
## `C:\Users\Joe\OneDrive - Liverpool John Moores University\ESRT JB Personal\wwf_eco\wwf_terr_ecos.shp'
## using driver `ESRI Shapefile'
## Simple feature collection with 14458 features and 21 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -180 ymin: -89.89197 xmax: 180 ymax: 83.62313
## Geodetic CRS: WGS 84
wwf_eco_v <- st_make_valid(wwf_eco) # Make valid
## Load in cleaned occurrence data (previously named GBIF_Dat_geo_post1980_1km_nd)
#cleanedOccs <- read.csv("C:\\Users\\Joe\\OneDrive - Liverpool John Moores University\\ESRT JB #Personal\\Protarchanara_brevilinea\\P_brevilinea_cleanOcc.csv")
# Coerce to sf data frame
#cleanedOccs_sf <- st_as_sf(cleanedOccs, coords = c("decimalLongitude", "decimalLatitude"), crs = 4326)
# Rename if following directly on from the "Preparing Occurrence Data" step
cleanedOccs_sf <- GBIF_Dat_geo_post1980_1km_nd_sf
## Extract occupied ecoregions
joined_data <- st_join(cleanedOccs_sf, wwf_eco_v) #extract ecoregions overlapping with cleaned occurrences
L3_ecoregions <- as.character(joined_data$ECO_NAME)
L3_ecoregions <- unique(L3_ecoregions)
occupiedEcoregions<- subset(wwf_eco_v, wwf_eco_v$ECO_NAME %in% L3_ecoregions)#polygons of occupied ecoregions
mapview(occupiedEcoregions) + mapview(cleanedOccs_sf)
## Create a buffer around occurrence points
buffered_area <- st_buffer(cleanedOccs_sf, dist = 300000) # Choose dist carefully (we chose 300km) by considering species range size and dispersal
buff_union <- st_make_valid(st_union(buffered_area)) # unify/dissolve the buffered point polygons and ensure validity
## Intersect the ecoregions and buffered points objects
clipped_ecoregions <- st_intersection(occupiedEcoregions, buff_union)
mapview(clipped_ecoregions)
## Mask the climate data by the occupied ecoregions
ClimVars_cr <- terra::crop(ClimVars, clipped_ecoregions) # Crop first
ClimVars_m <- terra::mask(ClimVars_cr, clipped_ecoregions) # Then mask
To fit the statistical models, absences or background points are required in addition to the presence records. Since true absence records require reliable ground-survey data to prove that a species is not present within a grid cell, artificially generated pseudo-absences are often used as surrogates of true absences. There are a range of approaches that have been proposed for pseudo-absence selection and we encourage you to think carefully about this step, as the method and number pseudo-absence points can strongly influence model predictions (see Barbet-Massin et al., 2012). Some frequently applied approaches include random sampling, stratified sampling (environmentally or spatially) and target-group sampling.
Here, we are going to use the simplest approach - random selection within the occupied ecoregion area from unoccupied raster cells. We will select 10,000 records at random as this number has been shown to work well with the modelling algorithms we intend to use: generalised linear models (GLM), generalised additive models (GAM) and MaxEnt. Note that for classification algorithms such as random forest (RF), using the same number of pseudo-absences as available presences may produce more reliable outputs.
## Sample pseudo-absences at random from unoccupied raster cells
ClimVars_masked <- mask(ClimVars_m, vect(cleanedOccs_sf), inverse = TRUE) # Mask presence locations in ClimVars_m
set.seed(42) # For reproducibility
pseudoAbs <- spatSample(ClimVars_masked, size = 10000, method = "random", na.rm = TRUE, xy = TRUE) # Sample 10,000 random points, avoiding presence locations
## Refine presence records to 1 per cell
occ_raster <- rasterize(vect(cleanedOccs_sf), ClimVars_m, field = NA, background = 0, touches = TRUE)
cell_coords <- crds(occ_raster, df = T, na.rm = T)
cell_values <- values(occ_raster, na.rm = T)
pres_coords <- cell_coords[cell_values == 1, ]
## Identify and remove presence records outside of area covered by climate data
Presences_vals <- raster::extract(ClimVars_m, pres_coords)
na_rows <- which(apply(Presences_vals, 1, function(x) any(is.na(x))))
pres_coords <- pres_coords[-na_rows, ]
## Format presences and pseudo-absences for modelling
pres_coords$Species <- 'P_brevilinea'
pres_coords$Binary <- 1
pres_coords$PA1 <- TRUE
pseudoAbs_coords <- data.frame(x = pseudoAbs$x,
y = pseudoAbs$y,
Species = 'P_brevilinea',
Binary = NA,
PA1 = TRUE)
PATable <- rbind(pres_coords, pseudoAbs_coords)
#write.csv(PATable, "./PATable_CV/PATable.csv")
To build and subsequently validate the models, we are going to adopt a block cross-validation approach, whereby the presence and pseudo-absence data are split into four geographically non-overlapping bins (Muscarella et al., 2014; Roberts et al., 2017), corresponding (as closely as possible) to each corner of the geographical modelling extent. Four separate composite models will be built for each model, with each composite model using three blocks for training and one block for testing until all blocks have been used for testing. This spatial blocking approach to model validation offers a more reliable assessment than randomly splitting the data for cross-validation, particularly when the goal is to make inferences beyond the training dataset, such as projecting models under climate change (Wenger and Olden, 2012; Roberts et al., 2017).
## Use a data-driven approach to select predictor variables for the model
pres_df <- data.frame(x = pres_coords$x, y = pres_coords$y)
abs_df <- data.frame(x = pseudoAbs_coords$x, y = pseudoAbs_coords$y)
## First we need to partition the presence and pseudo-absence data into spatial blocks using the ENMeval package. Another package, blockCV, has also been
# created specifically for this purpose and offers more flexibility in approach
block_partitions <- get.block(occs = pres_df, bg = abs_df, orientation = "lat_lon")
# Convert `occs.grp` and `bg.grp` to a data frame
occs_df <- data.frame(
X_PA1_RUN1 = block_partitions$occs.grp == 1,
X_PA1_RUN2 = block_partitions$occs.grp == 2,
X_PA1_RUN3 = block_partitions$occs.grp == 3,
X_PA1_RUN4 = block_partitions$occs.grp == 4
)
bg_df <- data.frame(
X_PA1_RUN1 = block_partitions$bg.grp == 1,
X_PA1_RUN2 = block_partitions$bg.grp == 2,
X_PA1_RUN3 = block_partitions$bg.grp == 3,
X_PA1_RUN4 = block_partitions$bg.grp == 4
)
cv_blocks <- rbind(occs_df, bg_df)
#write.csv(cv_blocks, "./PATable_CV/cv_blocks.csv")
# Some code to check the output
check_blocks <- cbind(PATable, cv_blocks)
check_blocks_sf <- st_as_sf(check_blocks, coords = c("x", "y"), crs = 4326)
block1 <- check_blocks_sf[check_blocks_sf$X_PA1_RUN1 == TRUE, ]
block2 <- check_blocks_sf[check_blocks_sf$X_PA1_RUN2 == TRUE, ]
block3 <- check_blocks_sf[check_blocks_sf$X_PA1_RUN3 == TRUE, ]
block4 <- check_blocks_sf[check_blocks_sf$X_PA1_RUN4 == TRUE, ]
mapview(block1, color = "blue") + mapview(block2, color = "red") + mapview(block3, color = "orange") + mapview(block4, color = "green")
First, it is important to stress that using expert knowledge to select candidate predictor variables for inclusion in ENMs represents one of the best approaches to maximise the biological significance of the models. However, experts may be biased when the species occurs outside of the geographical area with which they are familiar resulting in variables only reflecting a subset of environmental conditions experienced by a species (Brandt et al. 2017).
A data-driven approach can also be used to select predictor variables. This approach also has advantages and disadvantages. An advantage of this approach is that it is more suited to handling big data contexts. A potential disadvantage is that the resulting variables may be biologically implausible or irrelevant (Brandt et al. 2017).
Here, we demonstrate a data-driven approach that selects climatic variables from bio1-19 using the varSel function from the R package SDMTune (Vignali et al., 2020). This approach considers both variable importance and multicollinearity during variable selection by removing collinear variables with the lowest model performance based on a chosen evaluation metric (using the same four calibration/validation datasets used in the other models). The selection process ends when the remaining variables have a correlation coefficient lower than the correlation threshold, which is often set as (|r| >0.7) (Dormann et al., 2013).
# Rename for SDMtune
#colnames(cv_blocks) <- c("_PA1_RUN1", "_PA1_RUN2", "_PA1_RUN3", "_PA1_RUN4")
#training_matrix <- as.matrix(cv_blocks[, c("_PA1_RUN1", "_PA1_RUN2", "_PA1_RUN3", "_PA1_RUN4")]) # Create the training matrix: each column represents one fold
#testing_matrix <- !training_matrix # Create the testing matrix by inverting the training matrix
#folds_list <- list(train = training_matrix, test = testing_matrix) # Combine the training and testing matrices into a list
# Prepare SWD object required by SDMtune
#Presences <- PATable[!is.na(PATable$Binary) & PATable$Binary == 1, c("x", "y")]
#Absences <- PATable[is.na(PATable$Binary), c("x", "y")]
#swd_ob <- prepareSWD(
# "P_brevilinea",
# ClimVars_m,
# p = Presences,
# a = Absences,
# categorical = NULL,
# verbose = TRUE
#)
# Plot correlation matrix
#plotCor(swd_ob,
# method = "pearson")
## Perform data driven variable selection
# Build a simple MaxEnt model with the following "default" settings:
# ! Note that we are building the model using the same block CV folds
# - linear, quadratic, product and hinge feature class combinations;
# - regularization multiplier equal to 1;
# - 500 algorithm iterations
#default_mod <- train(method = "Maxent",
# data = swd_ob,
# folds = folds_list)
# 0.70 Pearson correlation threshold
#selected_vars_mod_p <- varSel(default_mod,
# metric = "auc",
# bg4cor = swd_ob,
# method = "pearson",
# cor_th = 0.70,
# env = ClimVars_m,
# permut = 1) # Increase this on actual run
# We can then write our predictor variables as a tif for the modelling script
#SelectedVars <- c(ClimVars_m$bio_2, # Mean Diurnal Range
# ClimVars_m$bio_3, # Isothermality
# ClimVars_m$bio_8, # Mean Temperature of Wettest Quarter
# ClimVars_m$bio_10, # Mean Temperature of Warmest Quarter
# ClimVars_m$bio_13, # Precipitation of Wettest Month
# ClimVars_m$bio_15, # Precipitation Seasonality
# ClimVars_m$bio_18) # Precipitation of Warmest Quarter
#writeRaster(SelectedVars, "C:\\Users\\Joe\\OneDrive - Liverpool John Moores University\\ESRT JB Personal\\ClimVars\\SelectedVars.tif")
## Load in the CMIP6 future climate data that we downloaded earlier
futureClim <- rast("C:\\Users\\Joe\\OneDrive - Liverpool John Moores University\\ESRT JB Personal\\ClimVars\\Future\\climate\\wc2.1_5m\\wc2.1_5m_bioc_HadGEM3-GC31-LL_ssp245_2041-2060.tif")
# Select the climate predictors that we are interested. In this tutorial, we adopted a data-driven approach using the SDMtune
# Note the different naming system between current and future!
FutureVars <- c(futureClim$bio02,
futureClim$bio03,
futureClim$bio08,
futureClim$bio10,
futureClim$bio13,
futureClim$bio15,
futureClim$bio18)
names(FutureVars) <- c("bio_2", "bio_3", "bio_8", "bio_10", "bio_13", "bio_15", "bio_18")
# Load current climate data
climate_path <- "C:\\Users\\Joe\\OneDrive - Liverpool John Moores University\\ESRT JB Personal\\ClimVars\\wc2.1_5m_baseline"
climate_tifs <- list.files(climate_path, pattern = "\\.tif$", full.names = TRUE) # List all .tif files in the directory
# Load each .tif file as a 'rast' object and store it in a list
climate_list <- lapply(climate_tifs, function(x) {
rast(x)
})
CurrentVars <- rast(climate_list) # Stack the list of SpatRaster elements into a single object with 19 layers
CurrentVars <- c(CurrentVars$wc2.1_5m_bio_2,
CurrentVars$wc2.1_5m_bio_3,
CurrentVars$wc2.1_5m_bio_8,
CurrentVars$wc2.1_5m_bio_10,
CurrentVars$wc2.1_5m_bio_13,
CurrentVars$wc2.1_5m_bio_15,
CurrentVars$wc2.1_5m_bio_18)
names(CurrentVars) <- c("bio_2", "bio_3", "bio_8", "bio_10", "bio_13", "bio_15", "bio_18")
# Download Administrative boundary of UK using gadm
# See gadm project: https://gadm.org/
UK_Poly <- gadm(country = "GBR", # Three-letter ISO code or full country name
level = 0, # The level of administrative subdivision requested. (starting with 0 for country, then 1 for the first level of subdivision)
version = "latest",
resolution = 1,
path = "C:\\Users\\Joe\\OneDrive - Liverpool John Moores University\\ESRT JB Personal") # 1 = high, 2 = low
# Buffer by 5km to preserve coastal cells
UK_Poly_buff <- terra::buffer(UK_Poly,
width = 5000)
# Mask the current and future climate data
curr_crop <- crop(CurrentVars, UK_Poly_buff)
CurrentMasked <- mask(curr_crop, UK_Poly_buff)
fut_crop <- crop(FutureVars, UK_Poly_buff)
FutureMasked <- mask(fut_crop, UK_Poly_buff)
# Before writing, rename the future layers to match the naming system used to fit the biomod2 ENMs
names(FutureMasked) <- c("bio_2", "bio_3", "bio_8", "bio_10", "bio_13", "bio_15", "bio_18")
#writeRaster(CurrentMasked, "C:\\Users\\Joe\\OneDrive - Liverpool John Moores University\\ESRT JB Personal\\ClimVars\\ClimVars_UK\\SelectedVars_Curr_UK.tif")
#writeRaster(FutureMasked, "C:\\Users\\Joe\\OneDrive - Liverpool John Moores University\\ESRT JB Personal\\ClimVars\\ClimVars_UK\\SelectedVars_Fut_UK.tif")
Acevedo, P., Jiménez-Valverde, A., Lobo, J.M. and Real, R., (2012) Delimiting the geographical background in species distribution modelling. Journal of Biogeography 398, 1383–1390.
Barbet‐Massin, M., Jiguet, F., Albert, C.H. and Thuiller, W., (2012) Selecting pseudo‐absences for species distribution models: How, where and how many?. Methods in ecology and evolution 3, 327-338.
Barve, N., Barve, V., Jiménez-valverde, A., Lira-noriega, A., Maher, S.P., Peterson, A.T., Soberón, J. and Villalobos, F., (2012) The crucial role of the accessible area in ecological niche modeling and species distribution modeling. Ecological Modelling 22211, 1810–1819.
Brandt, L.A., Benscoter, A.M., Harvey, R., Speroterra, C., Bucklin, D., Romañach, S.S., Watling, J.I. and Mazzotti, F.J., (2017) Comparison of climate envelope models developed using expert-selected variables versus statistical selection. Ecological Modelling 345, 10-20.
Dormann, C.F., Elith, J., Bacher, S., Buchmann, C., Carl, G., Carré, G., Marquéz, J.R.G., Gruber, B., Lafourcade, B., Leitão, P.J. and Münkemüller, T., (2013) Collinearity: a review of methods to deal with it and a simulation study evaluating their performance. Ecography 36, 27-46.
Hällfors, M. H., J. Liao, J. Dzurisin, R. Grundel, M. Hyvärinen, K. Towle, G. C. Wu, and J. J. Hellmann. (2016) Addressing potential local adaptation in species distribution models: Implications for conservation under climate change. Ecological Applications 26, 1154–1169.
Hausfather, Z., K. Marvel, G. A. Schmidt, J. W. Nielsen-Gammon, and M. Zelinka. (2022) Climate simulations: recognize the ‘hot model’ problem. Nature 605, 26–29.
Lange, S. (2021) ISIMIP3b bias adjustment fact sheet. Potsdam: Inter-Sectoral Impact Model Intercomparison Project.
Muscarella, R., Galante, P.J., Soley‐Guardia, M., Boria, R.A., Kass, J.M., Uriarte, M. and Anderson, R.P., (2014) ENM eval: An R package for conducting spatially independent evaluations and estimating optimal model complexity for Maxent ecological niche models. Methods in ecology and evolution 5, 1198-1205.
Raes, N. 2012. Partial versus full species distribution models. Natureza a Conservacao 10, 127–138.
Roberts, D.R., Bahn, V., Ciuti, S., Boyce, M.S., Elith, J., Guillera‐Arroita, G., Hauenstein, S., Lahoz‐Monfort, J.J., Schröder, B., Thuiller, W. and Warton, D.I., (2017). Cross‐validation strategies for data with temporal, spatial, hierarchical, or phylogenetic structure. Ecography 40, 913-929.
Sánchez-Fernández, D., J. M. Lobo, and O. L. Hernández-Manrique. (2011) Species distribution models that do not incorporate global data misrepresent potential distributions: A case study using Iberian diving beetles. Diversity and Distributions 17, 163–171.
Sillero, N., S. Arenas-Castro, U. Enriquez-Urzelai, C. G. Vale, D. Sousa-Guedes, F. Martínez-Freiría, R. Real, and A. M. Barbosa. (2021) Want to model a species niche? A step-by-step guideline on correlative ecological niche modelling. Ecological Modelling 456, e109671.
Smith, A. B., W. Godsoe, F. Rodríguez-Sánchez, H. H. Wang, and D. Warren. (2019) Niche Estimation Above and Below the Species Level. Trends in Ecology and Evolution 34, 260–273.
VanDerWal, J., Shoo, L.P., Graham, C. and Williams, S.E., (2009) Selecting pseudo-absence data for presence-only distribution modeling: How far should you stray from what you know? Ecological Modelling 2204, 589–594.
Vignali, S., Barras, A.G., Arlettaz, R. and Braunisch, V., 2020. SDMtune: An R package to tune and evaluate species distribution models. Ecology and Evolution 10, 11488-11506.
Wenger, S.J. and Olden, J.D., (2012) Assessing transferability of ecological models: an underappreciated aspect of statistical validation. Methods in Ecology and Evolution, 3, 260-267.
In order to build our ENM for P. brevilinea and project it across the UK to search for potential assisted colonisation sites, we are going to use the biomod2 R package. biomod2 provides functionality for ecological niche modelling, calibration and evaluation, ensembles of models, ensemble forecasting and visualisation. It is under constant development so be aware of potential updates since the publication of this tutorial. The code below is based on package version 4.2-6-2.
library(biomod2)
library(dismo)
library(terra)
setwd("C:\\Users\\Joe\\OneDrive - Liverpool John Moores University\\ESRT JB Personal")
## Load in the required files
# PATable
PATable <- read.csv("./PATable_CV/PATable.csv", head = T)
# Cross-validation block partitions
cv_blocks <- read.csv("./PATable_CV/cv_blocks.csv", head = T)
blocks_PA <- cv_blocks[, 2:5] # subset to just the calib lines
colnames(blocks_PA) <- c("_PA1_RUN1", "_PA1_RUN2", "_PA1_RUN3", "_PA1_RUN4")
# Load in the selected climate variables masked to the full background modelling extent (i.e., across occupied ecoregions)
ModClimVars <- rast("C:\\Users\\Joe\\OneDrive - Liverpool John Moores University\\ESRT JB Personal\\ClimVars\\SelectedVars.tif")
# Load in the selected clim vars masked to projection extent (UK)
#Curr_UK <- rast("./ClimVars/ClimVars_UK/SelectedVars_Curr_UK.tif") # Baseline
Curr_UK <- CurrentMasked
#Fut_UK <- rast("./ClimVars/ClimVars_UK/SelectedVars_Fut_UK.tif") # Future
Fut_UK <- FutureMasked
This first step involves structuring the observation and predictor variable data in the format required by biomod2.
## Format the Presence/(pseudo)absence data and naming in preparation for modelling
# Abbreviated name of your species
ModelName <- "P.brevilinea"
## Format data for biomod
myBiomodData <- BIOMOD_FormatingData(resp.var= as.numeric(PATable$Binary), # vector of presence
expl.var = ModClimVars, # stack with all raster (predictors)
resp.xy = PATable[,c("x","y")], # x and y of presence and absence locations
resp.name= ModelName, # Name of the species
PA.nb.rep = 1, # Number of pseudo-absence replicates
PA.strategy = 'user.defined', ### strategy for Pseudoabsences selection.
PA.user.table = data.frame(PATable$PA1),
na.rm=F) # There shouldn't be any NAs based on the cleaning/pseudo-absence precautions
##
## -=-=-=-=-=-=-=-=-=-=-=-=-= P.brevilinea Data Formating -=-=-=-=-=-=-=-=-=-=-=-=-=
##
##
## Checking Pseudo-absence selection arguments...
##
## > User defined pseudo absences selection
##
## ! No data has been set aside for modeling evaluation
## -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= Done -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
# Print the formated data
myBiomodData
##
## -=-=-=-=-=-=-=-=-=-=-=-=-=-= BIOMOD.formated.data -=-=-=-=-=-=-=-=-=-=-=-=-=-=
##
## dir.name = .
##
## sp.name = P.brevilinea
##
## 78 presences, 0 true absences and 10000 undefined points in dataset
##
##
## 7 explanatory variables
##
## bio_2 bio_3 bio_8 bio_10
## Min. : 3.290 Min. :17.41 Min. : 1.113 Min. :10.45
## 1st Qu.: 6.990 1st Qu.:25.90 1st Qu.: 7.358 1st Qu.:15.23
## Median : 7.638 Median :29.13 Median :13.128 Median :15.93
## Mean : 7.523 Mean :30.32 Mean :11.640 Mean :15.96
## 3rd Qu.: 8.110 3rd Qu.:35.08 3rd Qu.:15.910 3rd Qu.:16.58
## Max. :10.933 Max. :44.95 Max. :18.210 Max. :21.09
## bio_13 bio_15 bio_18
## Min. : 49 Min. : 8.873 Min. :111.0
## 1st Qu.: 72 1st Qu.:19.673 1st Qu.:178.0
## Median : 79 Median :26.440 Median :204.0
## Mean : 85 Mean :25.135 Mean :202.8
## 3rd Qu.: 85 3rd Qu.:30.769 3rd Qu.:221.0
## Max. :333 Max. :41.596 Max. :464.0
##
##
## 1 Pseudo Absences dataset available ( PA1 ) with 10000 (PA1)
## pseudo absences
##
## -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
In this step, we choose which algorithms we want to calibrate on the data. At the time of writing, there are 12 different modelling algorithms that can be implemented within the biomod2 package, including a range of regression (e.g., generalised linear models and generalised additive models) and machine learning techniques (e.g., random forests, generalised boosted models and MaxEnt). biomod2 acts as a wrapper that calls modelling functions from various external packages. It automatically retrieves the modelling options (arguments) from those packages, making it possible to access and modify some of the settings supported by each algorithm.
Here, we are going to focus on two regression techniques (generalised linear models and generalised additive models) and a machine learning technique (MaxEnt). Rather than using the default settings for these algorithms, we will use the settings provided by the biomod2 team, which have been modified to make them more appropriate for ecological niche modelling. You can print the “OptionsBigboss” to view the modified settings for each algorithm. For generalised linear models (GLMs) and generalised additive models (GAM) (note that we are accessing the gam function from the mgcv package, not the gam package, which is also accessible through biomod2), the modified settings focus on simplifying and speeding up the models. We recommend reading this tutorial on the biomod2 github page to learn more about the different modelling options and the tuning functionality, which allows the user to optimise model settings specifically for their data.
The predictive performance of ENMs can be validated by three components: classification capacity, discrimination capacity and calibration (Sillero et al., 2021). Classification capacity indicates the ability of a model to correctly classify occupied sites as suitable (i.e., species is present) and unoccupied sites as unsuitable (species is absent), based on a threshold value. We will use the true skill statistic (TSS) as a measure of classification capacity, which is equal to Sensitivity (proportion of correctly classified presences) + Specificity (proportion of correctly classified absences, or pseudo-absences/background points in our case) – 1. TSS ranges from −1 to 1, with 0 corresponding to random classification. Discrimination capacity provides information on how well a model distinguishes between occupied and unoccupied sites. To assess discrimination, we will use the area under the curve (AUC) of the receiver operating characteristic (ROC), which plots Sensitivity against 1-Specificity, and does not require a threshold value. AUC ranges from 0 to 1, where 0.5 indicates a model performing no better than random and 1 represents perfect discrimination between occupied and unoccupied. Goodness of calibration measures the model’s ability to predict higher suitability in environmental conditions where a species is more frequently observed. For example, in a well-calibrated model, pixels with predicted values of 0.3 versus 0.9 would be occupied by the species 30% versus 90% of the time, respectively. To measure calibration, we will use the continuous Boyce index (BOYCE), which ranges from –1 to 1, where more positive values suggest that model predictions match well with the presence data, and negative values suggest a poor match.
myBiomodModelOut <- suppressWarnings(suppressMessages(BIOMOD_Modeling(bm.format = myBiomodData,
models = c('GAM', 'GLM', 'MAXENT'), # Selected modelling algorithms
CV.strategy='user.defined', # Cross-validation method
CV.user.table = blocks_PA, # defines whether models should be calib/valid over the whole dataset, too
CV.do.full.models = FALSE,
#CV.nb.rep=0,
#CV.perc=NULL,
OPT.strategy = 'bigboss', # method to select the algorithm parameter values
var.import=5, # Number of permutations for calculating variable importance
metric.eval= c('TSS', 'ROC', 'BOYCE'), # Selected metrics for evaluation
do.progress = FALSE)))
## Get variable importance in each run
myModelsVarImport<-get_variables_importance(myBiomodModelOut) # importance of explanatory variables in each run
#write.csv(myModelsVarImport, file=paste0("./Modelling/Variable_importance/VarImp_by_run_", ModelName,".csv"))
## Get evaluation scores in each run
myBiomodModelEval <- get_evaluations(myBiomodModelOut) # get evaluation scores for each run
#write.csv(myBiomodModelEval, file=paste0("./Modelling/Model_Evaluation/byRun_", ModelName,".csv"))
In this step, we combine the models from the previous step in an ensemble model to reach a consensus using the weighted mean prediction. This approach takes into account the model evaluation scores when combining models and has been shown to provide more robust predictions than most other consensus methods (Marmion et al., 2009; Hao et al. 2019). We also exclude models with poor discrimination capacity using a threshold of ROC 0.7.
myBiomodEM_all <- suppressWarnings(suppressMessages(BIOMOD_EnsembleModeling(bm.mod = myBiomodModelOut,
models.chosen = 'all',
em.by = 'all',
em.algo = c('EMwmean'), # Method for averaging predicted likelihood of occurrence
metric.select = c('ROC'), # Evaluation method for weighting models
metric.select.thresh = c(0.7), # AUC threshold for excluding poor performing models from the ensemble
metric.eval = c('TSS', 'ROC', 'BOYCE'),
var.import = 5, # Number of random permutations for calculating variable importance
EMwmean.decay = 'proportional',
do.progress = FALSE)))
## Get variable importance in the ensemble
myEM_all_VarImport <-get_variables_importance(myBiomodEM_all) # importance of explanatory variables in the ensemble
#write.csv(myEM_all_VarImport, file=paste0("./Modelling/Variable_importance/VarImp_ensemble_", ModelName,".csv"))
# Plot the variable importance scores for the ensemble model
VarImpPlot <- bm_PlotVarImpBoxplot(bm.out = myBiomodEM_all, group.by = c("expl.var", "algo", "full.name"))
#VarImpPlot$plot
## Get evaluation scores in the ensemble
myBiomodEMEval_all<- get_evaluations(myBiomodEM_all) #get evaluation (calibration) scores for the full ensemble model ('all')
myBiomodEMEval_all[,7:11]
## metric.eval cutoff sensitivity specificity calibration
## 1 TSS 543.0 85.897 89.95 0.759
## 2 ROC 547.5 85.897 90.21 0.929
## 3 BOYCE 551.0 84.615 90.42 0.919
#write.csv(myBiomodEMEval_all, file=paste0("./Modelling/Model_Evaluation/Ensemble_", ModelName,".csv"))
The variable importance plot suggests that the ensemble model is primarily being driven by bio2 (Mean Diurnal Range) and bio10 (Mean temperature of the Warmest Quarter). The TSS, ROC and BOYCE evaluation scores indicate that the ensemble model has excellent classification accuracy, discrimination capacity and calibration performance. With these strong results, we can confidently move on to the next phase: projecting the ensemble model spatially.
Now we project the models spatially across our region(s) of interest.
## First, we can project the models across the entire modelling extent (i.e., across occupied ecoregions)
# Project individual models first
myBiomodProj_ecoreg <- suppressWarnings(suppressMessages(BIOMOD_Projection(bm.mod = myBiomodModelOut,
proj.name = 'ModelOutProj_ecoreg',
new.env = ModClimVars, # Projection across UK
models.chosen = 'all',
metric.binary = 'all',
build.clamping.mask = TRUE))) # Produces raster showing n vars p/pixel that are out of their calib/valid range
# Ensemble forecast onto occupied ecoregions
myBiomodEF_all_ecoreg <- suppressWarnings(suppressMessages(BIOMOD_EnsembleForecasting(bm.em = myBiomodEM_all,
bm.proj = myBiomodProj_ecoreg,
proj.name = 'all_EnsembleProj_ecoreg',
models.chosen = 'all')))
myProjEM_all_ecoreg <-get_predictions(myBiomodEF_all_ecoreg) # Get the map
plot(myProjEM_all_ecoreg, main = "Predicted suitability across modelling extent")
## Now project across United Kingdom (i.e. the area where we are searching for potential AC sites)
#### Current/baseline climate ####
myBiomodProj_UK_Curr <- suppressWarnings(suppressMessages(BIOMOD_Projection(bm.mod = myBiomodModelOut,
proj.name = 'ModelOutProj_UK_Curr',
new.env = Curr_UK, # Baseline data across UK
models.chosen = 'all',
metric.binary = 'all',
build.clamping.mask = TRUE))) # Produces raster showing n vars in each pixel that are out of their calib/valid range
# Ensemble forecast onto UK
myBiomodEF_all_UK_Curr <- suppressWarnings(suppressMessages(BIOMOD_EnsembleForecasting(bm.em = myBiomodEM_all,
bm.proj = myBiomodProj_UK_Curr,
proj.name = 'all_EnsembleProj_UK_Curr',
models.chosen = 'all')))
myProjEM_all_UK_Curr <-get_predictions(myBiomodEF_all_UK_Curr) # Get the map
#### Future climate ####
myBiomodProj_UK_Fut <- suppressWarnings(suppressMessages(BIOMOD_Projection(bm.mod = myBiomodModelOut,
proj.name = 'ModelOutProj_UK_Fut',
new.env = Fut_UK, # Future data across UK
models.chosen = 'all',
metric.binary = 'all',
build.clamping.mask = TRUE))) # Produces raster showing n vars in each pixel that are out of their calib/valid range
# Ensemble forecast onto UK
myBiomodEF_all_UK_Fut <- suppressWarnings(suppressMessages(BIOMOD_EnsembleForecasting(bm.em = myBiomodEM_all,
bm.proj = myBiomodProj_UK_Fut,
proj.name = 'all_EnsembleProj_UK_Fut',
models.chosen = 'all')))
myProjEM_all_UK_Fut <-get_predictions(myBiomodEF_all_UK_Fut) # Get the map
# Plot current and future suitability side by side
par(mfrow = c(1, 2)) # Set layout to 1 row, 2 columns
plot(myProjEM_all_UK_Curr, main = "Current suitability")
plot(myProjEM_all_UK_Fut, main = "Future suitability (2041-2060)")
Both continuous and binary (suitable/unsuitable) predictions of suitability are useful for selecting recipient sites. When using pseudo-absences rather than true absences, one of the most reliable statistical approaches for converting continuous predictions to binary is to select the threshold value that maximises the TSS score. This approach has consistently been shown to produce more accurate outputs in modelling exercises where presences and pseudo-absences were used (Liu et al., 2005; 2013).
ecoreg_bin <- bm_BinaryTransformation(data = myProjEM_all_ecoreg[[1]],
threshold = myBiomodEMEval_all[1, 8]) # We want the "cutoff" column value for TSS
UK_curr_bin <- bm_BinaryTransformation(data = myProjEM_all_UK_Curr[[1]],
threshold = myBiomodEMEval_all[1, 8])
UK_fut_bin <- bm_BinaryTransformation(data = myProjEM_all_UK_Fut[[1]],
threshold = myBiomodEMEval_all[1, 8])
par(mfrow = c(1, 2))
plot(UK_curr_bin, main = "Current suitability (binary)", cex.main = 0.8)
plot(UK_fut_bin, main = "Future suitability (2041-2060) (binary)", cex.main = 0.8)
#writeRaster(myProjEM_all_ecoreg[[1]], "./Modelling/Spatial_Projections/Continuous_Proj_Ecoregion.tif")
#writeRaster(myProjEM_all_UK_Curr[[1]], "./Modelling/Spatial_Projections/Continuous_Proj_UK_Current.tif")
#writeRaster(myProjEM_all_UK_Fut[[1]], "./Modelling/Spatial_Projections/Continuous_Proj_UK_245_HadGEM_41-60.tif")
#writeRaster(ecoreg_bin, "./Modelling/Spatial_Projections/Binary_Proj_Ecoregion.tif", overwrite = T)
#writeRaster(UK_curr_bin, "./Modelling/Spatial_Projections/Binary_Proj_UK_Current.tif")
#writeRaster(UK_fut_bin, "./Modelling/Spatial_Projections/Binary_Proj_245_HadGEM_41-60.tif", )
It is important to quantify potential extrapolation when transferring model predictions across space and time. The ability to effectively plan a translocation using future projections from ENMs relies on the transparency and documentation of the modelling approach. A popular method for assessing the presence of non-analogue climates in the projection data is the multivariate environmental similarity surface (MESS) analysis (Elith et al. 2010), which assesses the degree of similarity between data used to calibrate the model and projection data. It can be useful to set a threshold which defines a point at which ENM predictions are too uncertain and likely no longer useful (termed “Forecast Proficiency Threshold”) for assessing the future suitability of recipient sites (Petchey et al. 2015).
## Compute MESS analysis to calculate similarity between calibration and projection data (i.e. all pixels across ecoregions for current, and UK for current and future)
referenceValues <- terra::extract(x = ModClimVars, y = PATable[,c("x","y")], method = 'simple', ID = F)
## MESS for CURRENT climate across ecoregion extent
ModClimVars_brick <- brick(ModClimVars) # Convert to RasterBrick required by dismo::mess
Mess_Curr_ecoreg <- suppressWarnings(suppressMessages(mess(x = ModClimVars_brick, # climate rasters
v = referenceValues, # climate data that was used to fit the model (i.e. at locations of presence and psuedo-absence)
full = FALSE))) # Returns single raster layer with MESS values rather than n layers corresponding to layers of input Raster and an additional layer with the MESS values
## MESS for CURRENT climate across UK extent
Curr_UK_brick <- brick(Curr_UK) # Convert to RasterBrick required by dismo::mess
Mess_Curr_UK <- suppressWarnings(suppressMessages(mess(x = Curr_UK_brick,
v = referenceValues,
full = FALSE)))
## MESS for FUTURE climate across UK extent
Fut_UK_brick <- brick(Fut_UK) # Convert to RasterBrick required by dismo::mess
Mess_Fut_UK <- suppressWarnings(suppressMessages(mess(x = Fut_UK_brick,
v = referenceValues,
full = FALSE)))
## Set all cells with MESS value of >0 to NA (any value below 0 indicates presence of non-analogue climates)
non_analog_clims <- Mess_Fut_UK
non_analog_clims[non_analog_clims > 0] <- NA
# Plot
par(mfrow = c(1, 2))
plot(Mess_Fut_UK, main = "Full MESS output")
plot(non_analog_clims, main = "Non-analogue climates")
# Write MESS outputs
#writeRaster(Mess_Curr_ecoreg, "./Modelling/MESS_Outputs/Mess_Curr_ecoreg.tif")
#writeRaster(Mess_Curr_UK, "./Modelling/MESS_Outputs/Mess_Curr_UK.tif")
#writeRaster(Mess_Fut_UK, "./Modelling/MESS_Outputs/Mess_Fut_UK_245_HadGEM_41-60.tif")
Elith, J., Kearney, M. and Phillips, S., (2010). The art of modelling range‐shifting species. Methods in ecology and evolution 1, 330-342.
Hao, T., Elith, J., Guillera‐Arroita, G. and Lahoz‐Monfort, J.J. (2019). A review of evidence about use and performance of species distribution modelling ensembles like BIOMOD. Diversity and Distributions 25, 839-852.
Liu, C., Berry, P.M., Dawson, T.P. and Pearson, R.G., (2005). Selecting thresholds of occurrence in the prediction of species distributions. Ecography, 28, 385-393.
Liu, C., White, M. and Newell, G., (2013). Selecting thresholds for the prediction of species occurrence with presence‐only data. Journal of biogeography 40, 778-789.
Marmion, M., Parviainen, M., Luoto, M., Heikkinen, R.K. and Thuiller, W., (2009). Evaluation of consensus methods in predictive species distribution modelling. Diversity and distributions 15, 59-69.
Petchey, O.L., Pontarp, M., Massie, T.M., Kéfi, S., Ozgul, A., Weilenmann, M., Palamara, G.M., Altermatt, F., Matthews, B., Levine, J.M. and Childs, D.Z., (2015). The ecological forecast horizon, and examples of its uses and determinants. Ecology letters 18, 597-611.
In this section, we demonstrate how to use the outputs generated from the ENMs to identify candidate sites for assisted colonisation. We do this by subsetting a UK land cover dataset (Corine land cover, CLC) to polygons depicting suitable habitat for our focal species that will retain their existing climatic suitability up to 2050.
library(dplyr)
library(mapview)
library(sf)
library(terra)
# Read in the land cover dataset. We will use CORINE here due to its small file size but LivingEngland may be a more up to date choice
CORINE <- st_read("C:\\Users\\Joe\\OneDrive - Liverpool John Moores University\\ESRT JB Personal\\CORINE_LUC_2018\\dat\\data\\clc2018_uk.shp")
## Reading layer `clc2018_uk' from data source
## `C:\Users\Joe\OneDrive - Liverpool John Moores University\ESRT JB Personal\CORINE_LUC_2018\dat\data\clc2018_uk.shp'
## using driver `ESRI Shapefile'
## replacing null geometries with empty geometries
## Simple feature collection with 78880 features and 10 fields (with 3393 geometries empty)
## Geometry type: POLYGON
## Dimension: XY
## Bounding box: xmin: -1142.646 ymin: -106907 xmax: 680645.2 ymax: 1243630
## Projected CRS: OSGB36 / British National Grid
# Subset the land cover dataset by classes that may provide habitat for the species. Also add an additional column for habitat descriptions.
CORINE_P_brev <- CORINE %>%
filter(CODE_18 %in% c("411", "421", "422", "423", "511", "512", "521")) %>%
mutate(Hab_Desc = case_when(
CODE_18 == "411" ~ "Inland marshes",
CODE_18 == "421" ~ "Salt marshes",
CODE_18 == "422" ~ "Salines",
CODE_18 == "423" ~ "Intertidal flats",
CODE_18 == "511" ~ "Water courses",
CODE_18 == "512" ~ "Water bodies",
CODE_18 == "521" ~ "Coastal lagoons"
))
# Transform to EPSG 4326 to match the raster data (it is far more straightforward and accurate to
# transform vector data vs raster data)
CORINE_P_brev_t <- st_transform(CORINE_P_brev, 4326)
# Next, we can remove polygons that are outside of current & future climate envelope
# Extract values of raster cells that overlap with polygons
vals_curr <- extract(UK_curr_bin, vect(CORINE_P_brev_t), touches = TRUE)
vals_fut <- extract(UK_fut_bin, vect(CORINE_P_brev_t), touches = TRUE)
# Group the values by polygon ID
vals_by_poly_curr <- split(vals_curr[[2]], vals_curr$ID)
vals_by_poly_fut <- split(vals_fut[[2]], vals_fut$ID)
# Add them as list-columns to the CORINE sf object
CORINE_P_brev_t$curr_vals <- vals_by_poly_curr[as.character(seq_len(nrow(CORINE_P_brev_t)))]
CORINE_P_brev_t$fut_vals <- vals_by_poly_fut[as.character(seq_len(nrow(CORINE_P_brev_t)))]
# Remove rows where 0 or NA exist, i.e., we only retain polygons/rows where climate is suitable NOW and in the FUTURE
CORINE_Suitable <- CORINE_P_brev_t %>%
filter(
!sapply(curr_vals, function(x) any(is.na(x) | x == 0)),
!sapply(fut_vals, function(x) any(is.na(x) | x == 0))
)
# Remove the two list columns
CORINE_Suitable$curr_vals <- NULL
CORINE_Suitable$fut_vals <- NULL
# Next, let's add the continuous suitability values to the polygons
curr_suit <- extract(myProjEM_all_UK_Curr[[1]], vect(CORINE_Suitable), touches = TRUE)
fut_suit <- extract(myProjEM_all_UK_Fut[[1]], vect(CORINE_Suitable), touches = TRUE)
# Group the values by polygon ID
vals_by_poly_currCont <- split(curr_suit[[2]], curr_suit$ID)
vals_by_poly_futCont <- split(fut_suit[[2]], fut_suit$ID)
# Add suitability columns and change str from list to character vectors with comma separation
CORINE_Suitable$Suitability_curr <- vals_by_poly_currCont[as.character(seq_len(nrow(CORINE_Suitable)))]
CORINE_Suitable$Suitability_fut <- vals_by_poly_futCont[as.character(seq_len(nrow(CORINE_Suitable)))]
CORINE_Suitable$Suitability_curr <- sapply(CORINE_Suitable$Suitability_curr, function(x) paste(x, collapse = ","))
CORINE_Suitable$Suitability_fut <- sapply(CORINE_Suitable$Suitability_fut, function(x) paste(x, collapse = ","))
CORINE_Suitable[,12:14]
## Simple feature collection with 143 features and 3 fields
## Geometry type: POLYGON
## Dimension: XY
## Bounding box: xmin: -2.661113 ymin: 50.61515 xmax: 1.716885 ymax: 53.15052
## Geodetic CRS: WGS 84
## First 10 features:
## Hab_Desc Suitability_curr Suitability_fut
## 1 Inland marshes 545 670
## 2 Inland marshes 595 624
## 3 Inland marshes 595 624
## 4 Inland marshes 549 654
## 5 Inland marshes 549 654
## 6 Inland marshes 549 654
## 7 Inland marshes 575 667
## 8 Inland marshes 555,599 665,669
## 9 Inland marshes 610 667
## 10 Inland marshes 625 665
## geometry
## 1 POLYGON ((-2.463776 50.6258...
## 2 POLYGON ((-1.566274 50.7344...
## 3 POLYGON ((-1.531361 50.7365...
## 4 POLYGON ((0.301632 50.85176...
## 5 POLYGON ((0.3220615 50.8626...
## 6 POLYGON ((0.3006311 50.8718...
## 7 POLYGON ((0.8445754 50.9665...
## 8 POLYGON ((1.203402 51.32181...
## 9 POLYGON ((0.8792845 51.3871...
## 10 POLYGON ((0.6980003 51.3931...
In the table above, we can see the current and projected future suitability values for the first 10 patches of potentially suitable habitats up to at least 2050. (note that some rows have multiple suitability values because the polygons overlap with multiple raster cells). We can also visualise the areas where there is potentially suitable habitat for an assisted colonisation.
mapview(CORINE_Suitable)